The problem exists that an organisation/company sometimes when new employees join the organisation, it can be difficult to categorise what type of pay scale that employee should fall under and requires a lot of negotiating which can be very time consuming. The organisation requires a way so that it can decide what salary boundary their new hires should fall under without the need for negotiation and use of time.
To address this problem, the use of a machine learning classification model is used. Within this model there exists a target feature called salary which categorises employee’s salary into either a boundary of above 50k or below 50k. Using this classification model, it is trained using the employee’s salary data set which holds information relevant to employment which would affect salaries.
The organisation can use the classification model built and provide it with relevant employee data based on features in the dataset to answer their questions on what salary boundary this employee should fall under for their new employees.
# Importing libraries
# Data processing,
import numpy as np
import pandas as pd
# Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Imputation
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')
# External Graphing Imports
import plotly.express as px
# Modelling Imports
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.preprocessing import LabelEncoder
#Individual Module Files
from modules.PlotlyGraphsModule_KS_29007147 import PlotlyGraphs_KS_29007147
from modules.CheckingDuplicatesModule_HS_29012930 import CheckingDuplicateValues_HS_29012930
from modules.scale_the_data_RG_29014027 import scale_the_data
from modules.RemoveCorrelatedInputs_AS_29020256 import remove_correlated_inputs
from modules.Normalisation_AS_29020256 import normalisation
from modules.SplitGraphs_RR_29003671 import perform_eda
from modules.Graphs_RR_29003671 import Graphs
pip install plotly
Extraction was done by Barry Becker from the 1994 Census database.
https://www.kaggle.com/datasets/ayessa/salary-prediction-classification
# Loading file
df = pd.read_csv("./salary.csv")
df.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Information about the dataset.
# Obtaining informaiton about the dataset rows and columns, to check the amount of records and features.
total_rows = df.shape[0]
total_columns = df.shape[1]
missing_values = sum(df.isna().sum())
duplicate_values = df.duplicated().sum()
data_types = df.dtypes
print(f"Total number of instances(rows) in dataset: {total_rows}")
print(f"Total number of features(columns) in dataset: {total_columns}")
print(f"Missing Values: {missing_values}")
print(f"Duplicate Values: {duplicate_values}\n")
data_types
Total number of instances(rows) in dataset: 32561 Total number of features(columns) in dataset: 15 Missing Values: 0 Duplicate Values: 24
age int64 workclass object fnlwgt int64 education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object salary object dtype: object
We check the different unique values that exist in each feature of our data set. The reasoning is sometimes datasets may contain unusual values as seen here like question mark values which count towards missing data. By doing this check we are seeing if our data contains anything unusual that we may need to clean in pre-processing.
# Identifiying Categorical features and associated labels
for col in df.columns:
if df[col].dtype=='object':
print()
print(col)
print(df[col].unique())
workclass [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov' ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked'] education [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college' ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school' ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th'] marital-status [' Never-married' ' Married-civ-spouse' ' Divorced' ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed'] occupation [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty' ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving' ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?' ' Protective-serv' ' Armed-Forces' ' Priv-house-serv'] relationship [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried' ' Other-relative'] race [' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other'] sex [' Male' ' Female'] native-country [' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South' ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran' ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand' ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic' ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia' ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago' ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary' ' Holand-Netherlands'] salary [' <=50K' ' >50K']
Each features number of unique values is checked. This to ensure there is different possible values for each feature. If a feature has only 1 possible value for each row it would mean that this data is not necessary for our models use and would not have an affect on our models performance so would be dropped ensuring our model power and resources are saved. In our problem this was not the case as seen as above.
def summarising(df):# define a function that takes the dataframe as input for the salary classification problem
summarising_df = pd.DataFrame({ # creating a new dataframe to store our required info
'uniques': df.nunique() # calculates the number of unique values for a collumn.
})
return summarising_df
summarising(df)
| uniques | |
|---|---|
| age | 73 |
| workclass | 9 |
| fnlwgt | 21648 |
| education | 16 |
| education-num | 16 |
| marital-status | 7 |
| occupation | 15 |
| relationship | 6 |
| race | 5 |
| sex | 2 |
| capital-gain | 119 |
| capital-loss | 92 |
| hours-per-week | 94 |
| native-country | 42 |
| salary | 2 |
Now another pre-processing step I would like to take is to check if there missing values in our data set although there may not be any this is still an important pre-processing step that needs to be taken now lets check if there are any missing values in our data set as this is an important pre-processing step
CheckingDuplicateValues_HS_29012930(df)
Missing Values: 0 Duplicate Values: 24
The external library used is Plotly, for creating graphs and visualisations. It provides tools that allow for the creation of scatter plots, line plots, bar charts, pie charts and various other visualisations as required. In this case, plotly was used to analyse the data and how the salaries were distributed between genders.
Gender Distribution:
The dataset is skewed towards males, with 21,790 male individuals compared to 10,771 females.
This gender bias may impact the accuracy of any salary-related predictions.
Salary Distribution:
The majority of individuals (24,720) have salaries at or below 50,000. Only 7,841 individuals earn more than 50,000. The dataset is imbalanced in favor of lower salaries.
Gender-Salary Relationship:
The two bar charts showing salary distribution for males and females both exhibit male dominance.
However, this could be influenced by the overall gender distribution in the dataset.
In summary, the dataset contains more male data points, and most individuals earn salaries below 50,000. Keep these factors in mind when interpreting any predictions or conclusions based on this data.
PlotlyGraphs_KS_29007147(df)
A correlation matrix is a table that shows the strength and direction of the linear relationship between pairs of variables in a dataset.
df_copy = df.copy()
labelencoder=LabelEncoder()
for column in df.columns:
df_copy[column] = labelencoder.fit_transform(df_copy[column])
df_copy.head()
correlation_matrix = df_copy.corr().abs()
correlation_matrix
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
Duplicate rows can have a negative influence on the dataset, the code below drops them from the dataset. The total number of rows removed is 24, which is less than 1% of the total dataset. The number of duplicated rows in the dataset was comparitively small. Dropping the duplicated rows left us with 0 data to train the model, which is still sufficient for making accurate predictions.
## Let's get rid of duplicate entries
df.drop_duplicates(keep='first',inplace=True)
rows = len(df.index)
columns = len(df.columns)
duplicate_values = df.duplicated().sum()
print(f"Rows: {rows}")
print(f"Columns: {columns}")
print(f"Duplicate Values: {duplicate_values}")
Rows: 32537 Columns: 15 Duplicate Values: 0
The data shows that workclass and occupation are correlated as one when data is missing the other is also missing.
# Removing extra spaces
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
# Repalces ? symbols to nan
df.replace('?', np.nan, inplace=True)
# Counting Nan values
print("Number of Nan Values:")
pd.isna(df).sum()[pd.isna(df).sum() > 0]
Number of Nan Values:
workclass 1836 occupation 1843 native-country 582 dtype: int64
The data shows that workclass and occupation are correlated as one when data is missing the other is also missing.
Mode imputation was employed to address missing values in categorical features like 'workclass' 'occupation' and 'native-country. This technique replaces missing entries with the most frequent value within each category. This helps improve the dataset for modeling by ensuring all features have complete data, allowing machine learning algorithms to learn more effectively from the patterns within the dataset.
# Mode Imputation to deal with NaN values in categorical features
# Create a SimpleImputer object using the 'most_frequent' strategy
imputer = SimpleImputer(strategy='most_frequent')
# Define the features to be imputed (categorical in this case)
features_to_impute = ['workclass', 'occupation', 'native-country']
# Select the features with missing values for imputation (from training data)
X_train = df[features_to_impute]
# Fit the imputer to learn the most frequent values in each category (using training data)
imputer.fit(X_train)
# Select the features with missing values for imputation (from testing data)
X_test = df[features_to_impute]
# Transform the test data by replacing missing values with the most frequent values
# learned from the training data
X_test_imputed = imputer.transform(X_test)
# Create a copy of the original DataFrame to avoid modifying the original data
df_updated = df.copy()
# Replace the missing values in the specified features with the imputed values
df_updated[features_to_impute] = X_test_imputed
# Print information about the updated DataFrame
rows = len(df_updated.index)
columns = len(df_updated.columns)
names_types = df_updated.dtypes
missing_values = sum(df_updated.isna().sum())
duplicate_values = df_updated.duplicated().sum()
print(f"Rows: {rows}")
print(f"Columns: {columns}")
print(f"Missing Values: {missing_values}")
print(f"Duplicate Values: {duplicate_values}\n")
print(names_types)
# Display the first 5 rows of the updated DataFrame
df_updated.head(5)
Rows: 32537 Columns: 15 Missing Values: 0 Duplicate Values: 0 age int64 workclass object fnlwgt int64 education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object salary object dtype: object
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
The code block renames features into more understandable names which reduces complexity.
# Renaming Labels
df_updated.replace({'workclass': {'State-gov': 'Govt.', 'Self-emp-not-inc': 'self_emp', 'Federal-gov': 'Govt.', 'Local-gov': 'Govt.', 'Self-emp-inc':'self_emp', 'Without-pay': 'Broke', 'Never-worked': 'Broke'}}, inplace=True)
df_updated.replace({'marital-status': {'Married-civ-spouse': 'Married', 'Divorced': 'DASW', 'Married-spouse-absent': 'DASW', 'Separated': 'DASW', 'Married-AF-spouse':'Married', 'Widowed': 'DASW'}}, inplace=True)
df_updated.replace({'occupation': {'Adm-clerical': 'Adminstration', 'Exec-managerial': 'Executive', 'Handlers-cleaners': 'Handlers', 'Prof-specialty': 'Professionals', 'Other-service' : 'Other', 'Craft-repair' : 'Repairing', 'Farming-fishing' : 'Farming', 'Transport-moving':'Transportation', 'Machine-op-inspct': 'MachineOp', 'Protective-serv' : 'ProtectiveServ', 'Priv-house-serv': 'HouseServ'}}, inplace=True)
df_updated.replace({'native-country': {'United-States': 'USA', 'South': 'SouthKorea', 'Puerto-Rico': 'PuertoRico', 'Dominican-Republic': 'DominicRep', 'Outlying-US(Guam-USVI-etc)':'OutlyingUSA', 'Trinadad&Tobago': 'Tri&Tob', 'Holand-Netherlands': 'Netherlands', 'Hong' : 'HongKong'}}, inplace=True)
df_updated.replace({'race': {'Asian-Pac-Islander': 'APAC', 'Amer-Indian-Eskimo': 'NatAm'}}, inplace=True)
# Checking Labels for each feature
for col in df_updated.columns:
if df_updated[col].dtype=='object':
print()
print(col)
print(df_updated[col].unique())
workclass ['Govt.' 'self_emp' 'Private' 'Broke'] education ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm' 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th' '1st-4th' 'Preschool' '12th'] marital-status ['Never-married' 'Married' 'DASW'] occupation ['Adminstration' 'Executive' 'Handlers' 'Professionals' 'Other' 'Sales' 'Repairing' 'Transportation' 'Farming' 'MachineOp' 'Tech-support' 'ProtectiveServ' 'Armed-Forces' 'HouseServ'] relationship ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative'] race ['White' 'Black' 'APAC' 'NatAm' 'Other'] sex ['Male' 'Female'] native-country ['USA' 'Cuba' 'Jamaica' 'India' 'Mexico' 'SouthKorea' 'PuertoRico' 'Honduras' 'England' 'Canada' 'Germany' 'Iran' 'Philippines' 'Italy' 'Poland' 'Columbia' 'Cambodia' 'Thailand' 'Ecuador' 'Laos' 'Taiwan' 'Haiti' 'Portugal' 'DominicRep' 'El-Salvador' 'France' 'Guatemala' 'China' 'Japan' 'Yugoslavia' 'Peru' 'OutlyingUSA' 'Scotland' 'Tri&Tob' 'Greece' 'Nicaragua' 'Vietnam' 'HongKong' 'Ireland' 'Hungary' 'Netherlands'] salary ['<=50K' '>50K']
# Checking counts for each Country
df_updated['native-country'].value_counts()
USA 29735 Mexico 639 Philippines 198 Germany 137 Canada 121 PuertoRico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 SouthKorea 80 China 75 Italy 73 DominicRep 70 Vietnam 67 Japan 62 Guatemala 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 France 29 Greece 29 Ecuador 28 Ireland 24 HongKong 20 Cambodia 19 Tri&Tob 19 Laos 18 Thailand 18 Yugoslavia 16 OutlyingUSA 14 Honduras 13 Hungary 13 Scotland 12 Netherlands 1 Name: native-country, dtype: int64
USA is top country which has the highest number of records this is around 90% of the total dataset.
Chosen to split the dataset for EDA as USA has majority of the dataset, therefore splitting dataset allows for better dsitribution and analysis of all the records for other countries.
# Splitting dataset into USA and Non-USA
USA = df_updated[df_updated['native-country'] == 'USA']
NonUSA = df_updated[df_updated['native-country'] != 'USA']
print('USA', USA.shape)
print('Non-USA', NonUSA.shape)
USA (29735, 15) Non-USA (2802, 15)
# Perform EDA on USA data
perform_eda(df_updated, "Total")
# Perform EDA on USA data
perform_eda(USA, "USA")
# Perform EDA on Non-USA data
perform_eda(NonUSA, "Non-USA")
# Numerical Analysis (Age & Hours/Week)
plt.subplots(figsize=(15,10))
plt.subplot(2,2,1)
plt.title('Age of the Individual : Histogram',fontsize=16)
sns.distplot(df_updated.age, bins=73)
plt.ylabel(None), plt.yticks([]), plt.xlabel(None)
plt.subplot(2,2,2)
plt.title('Hours / Week: Histogram', fontsize=16)
sns.distplot(df_updated['hours-per-week'], color='#40E0D0', bins=98)
plt.ylabel(None), plt.yticks([]), plt.xlabel(None)
plt.subplot(2,2,3)
plt.title('Age of the Individual : Box & Whisker Plot', fontsize=16)
sns.boxplot(df_updated['age'], orient='h',color="#c7e9b4")
plt.subplot(2,2,4)
plt.title('Hours / Week: Box & Whisker Plot', fontsize=16)
sns.boxplot(df_updated['hours-per-week'], orient='h', color="#c7e9b4")
plt.show()
Using box & whisker plots and histograms to understand distrubtion and identify outliers.
Graphs(df_updated)
We employed LabelEncoder to convert categorical values within our dataset into numerical representations. This technique assigns a unique integer to each category within a feature, enabling algorithms that require numerical input to process the data effectively. By applying LabelEncoder to all columns in the DataFrame, we've transformed the entire dataset into numerical form for further analysis and modeling.
# Using LabelEncoder to convert catergory values to ordinal
labelencoder=LabelEncoder()
for column in df_updated.columns:
df_updated[column] = labelencoder.fit_transform(df_updated[column])
df_updated.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 1 | 2671 | 9 | 12 | 2 | 0 | 1 | 4 | 1 | 25 | 0 | 39 | 38 | 0 |
| 1 | 33 | 3 | 2926 | 9 | 12 | 1 | 2 | 0 | 4 | 1 | 0 | 0 | 12 | 38 | 0 |
| 2 | 21 | 2 | 14086 | 11 | 8 | 0 | 4 | 1 | 4 | 1 | 0 | 0 | 39 | 38 | 0 |
| 3 | 36 | 2 | 15336 | 1 | 6 | 1 | 4 | 0 | 1 | 1 | 0 | 0 | 39 | 38 | 0 |
| 4 | 11 | 2 | 19355 | 9 | 12 | 1 | 8 | 5 | 1 | 0 | 0 | 0 | 39 | 4 | 0 |
correlation_matrix = df_updated.corr().abs()
correlation_matrix
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | 0.058500 | 0.078184 | 0.010539 | 0.036244 | 0.475668 | 0.002769 | 0.263828 | 0.027600 | 0.088740 | 0.125907 | 0.065035 | 0.068878 | 0.000507 | 0.234131 |
| workclass | 0.058500 | 1.000000 | 0.021546 | 0.025152 | 0.070432 | 0.009118 | 0.061881 | 0.087260 | 0.077088 | 0.114771 | 0.037712 | 0.019732 | 0.096918 | 0.017298 | 0.026338 |
| fnlwgt | 0.078184 | 0.021546 | 1.000000 | 0.026957 | 0.042957 | 0.033027 | 0.008916 | 0.006958 | 0.039421 | 0.025985 | 0.004586 | 0.009901 | 0.019313 | 0.069017 | 0.010605 |
| education | 0.010539 | 0.025152 | 0.026957 | 1.000000 | 0.359085 | 0.002876 | 0.059741 | 0.011057 | 0.013969 | 0.027433 | 0.031448 | 0.016157 | 0.056784 | 0.077512 | 0.079366 |
| education-num | 0.036244 | 0.070432 | 0.042957 | 0.359085 | 1.000000 | 0.017983 | 0.058692 | 0.094432 | 0.029606 | 0.012205 | 0.154414 | 0.084144 | 0.150402 | 0.091344 | 0.335272 |
| marital-status | 0.475668 | 0.009118 | 0.033027 | 0.002876 | 0.017983 | 1.000000 | 0.014520 | 0.042539 | 0.017139 | 0.074297 | 0.047855 | 0.026197 | 0.110990 | 0.009382 | 0.106078 |
| occupation | 0.002769 | 0.061881 | 0.008916 | 0.059741 | 0.058692 | 0.014520 | 1.000000 | 0.134775 | 0.027019 | 0.195508 | 0.009115 | 0.002675 | 0.018043 | 0.004515 | 0.004927 |
| relationship | 0.263828 | 0.087260 | 0.006958 | 0.011057 | 0.094432 | 0.042539 | 0.134775 | 1.000000 | 0.123128 | 0.582594 | 0.093197 | 0.064319 | 0.251253 | 0.010479 | 0.250948 |
| race | 0.027600 | 0.077088 | 0.039421 | 0.013969 | 0.029606 | 0.017139 | 0.027019 | 0.123128 | 1.000000 | 0.096192 | 0.027443 | 0.017728 | 0.046902 | 0.119721 | 0.072093 |
| sex | 0.088740 | 0.114771 | 0.025985 | 0.027433 | 0.012205 | 0.074297 | 0.195508 | 0.582594 | 0.096192 | 1.000000 | 0.077602 | 0.049550 | 0.231232 | 0.000828 | 0.215969 |
| capital-gain | 0.125907 | 0.037712 | 0.004586 | 0.031448 | 0.154414 | 0.047855 | 0.009115 | 0.093197 | 0.027443 | 0.077602 | 1.000000 | 0.057015 | 0.101346 | 0.013882 | 0.340019 |
| capital-loss | 0.065035 | 0.019732 | 0.009901 | 0.016157 | 0.084144 | 0.026197 | 0.002675 | 0.064319 | 0.017728 | 0.049550 | 0.057015 | 1.000000 | 0.058805 | 0.010404 | 0.162494 |
| hours-per-week | 0.068878 | 0.096918 | 0.019313 | 0.056784 | 0.150402 | 0.110990 | 0.018043 | 0.251253 | 0.046902 | 0.231232 | 0.101346 | 0.058805 | 1.000000 | 0.006495 | 0.232365 |
| native-country | 0.000507 | 0.017298 | 0.069017 | 0.077512 | 0.091344 | 0.009382 | 0.004515 | 0.010479 | 0.119721 | 0.000828 | 0.013882 | 0.010404 | 0.006495 | 1.000000 | 0.023652 |
| salary | 0.234131 | 0.026338 | 0.010605 | 0.079366 | 0.335272 | 0.106078 | 0.004927 | 0.250948 | 0.072093 | 0.215969 | 0.340019 | 0.162494 | 0.232365 | 0.023652 | 1.000000 |
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
clean_data = remove_correlated_inputs(df_updated.copy())
clean_data.head(10)
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 1 | 2671 | 9 | 12 | 2 | 0 | 1 | 4 | 1 | 25 | 0 | 39 | 38 | 0 |
| 1 | 33 | 3 | 2926 | 9 | 12 | 1 | 2 | 0 | 4 | 1 | 0 | 0 | 12 | 38 | 0 |
| 2 | 21 | 2 | 14086 | 11 | 8 | 0 | 4 | 1 | 4 | 1 | 0 | 0 | 39 | 38 | 0 |
| 3 | 36 | 2 | 15336 | 1 | 6 | 1 | 4 | 0 | 1 | 1 | 0 | 0 | 39 | 38 | 0 |
| 4 | 11 | 2 | 19355 | 9 | 12 | 1 | 8 | 5 | 1 | 0 | 0 | 0 | 39 | 4 | 0 |
| 5 | 20 | 2 | 17700 | 12 | 13 | 1 | 2 | 5 | 4 | 0 | 0 | 0 | 39 | 38 | 0 |
| 6 | 32 | 2 | 8536 | 6 | 4 | 0 | 7 | 1 | 1 | 0 | 0 | 0 | 15 | 21 | 0 |
| 7 | 35 | 3 | 13620 | 11 | 8 | 1 | 2 | 0 | 4 | 1 | 0 | 0 | 44 | 38 | 1 |
| 8 | 14 | 2 | 1318 | 12 | 13 | 2 | 8 | 1 | 4 | 0 | 105 | 0 | 49 | 38 | 1 |
| 9 | 25 | 2 | 8460 | 9 | 12 | 1 | 2 | 0 | 4 | 1 | 79 | 0 | 39 | 38 | 1 |
This script takes an additional argument, threshold, which should be the correlation threshold above which a feature should be removed. In this case, it is 0.9. The cleaned data is saved to a new CSV file named 'clean_data.csv'. Also, removing correlated features can sometimes lead to loss of information, so it’s important to understand the trade-offs. From our correlation chart and our data visualizations we can see that there is no data in this case that is so heavily correlated that we would need to remove it. It is still a useful thing to check for for the sake of validating and accuracy of our results.
normalisation(clean_data)
# Load the data from a CSV file
clean_data = pd.read_csv("normalized_data.csv")
print(clean_data.head(10))
age workclass fnlwgt education education-num marital-status \ 0 0.030778 -1.962630 -1.294106 -0.335437 1.134739 1.218073 1 0.837509 2.053408 -1.251950 -0.335437 1.134739 -0.161128 2 -0.042561 0.045389 0.593020 0.181332 -0.420060 -1.540329 3 1.057526 0.045389 0.799670 -2.402511 -1.197459 -0.161128 4 -0.775952 0.045389 1.464090 -0.335437 1.134739 -0.161128 5 -0.115900 0.045389 1.190486 0.439716 1.523438 -0.161128 6 0.764169 0.045389 -0.324505 -1.110590 -1.974858 -1.540329 7 0.984187 2.053408 0.515981 0.181332 -0.420060 -0.161128 8 -0.555934 0.045389 -1.517784 0.439716 1.523438 1.218073 9 0.250796 0.045389 -0.337069 -0.335437 1.134739 -0.161128 occupation relationship race sex capital-gain capital-loss \ 0 -1.718121 -0.277805 0.400252 0.703071 0.793942 -0.204177 1 -1.207608 -0.900181 0.400252 0.703071 -0.279023 -0.204177 2 -0.697096 -0.277805 0.400252 0.703071 -0.279023 -0.204177 3 -0.697096 -0.900181 -2.310923 0.703071 -0.279023 -0.204177 4 0.323929 2.211698 -2.310923 -1.422331 -0.279023 -0.204177 5 -1.207608 2.211698 0.400252 -1.422331 -0.279023 -0.204177 6 0.068672 -0.277805 -2.310923 -1.422331 -0.279023 -0.204177 7 -1.207608 -0.900181 0.400252 0.703071 -0.279023 -0.204177 8 0.323929 -0.277805 0.400252 -1.422331 4.227429 -0.204177 9 -1.207608 -0.900181 0.400252 0.703071 3.111545 -0.204177 hours-per-week native-country salary 0 -0.031122 0.263562 -0.563199 1 -2.254475 0.263562 -0.563199 2 -0.031122 0.263562 -0.563199 3 -0.031122 0.263562 -0.563199 4 -0.031122 -5.281733 -0.563199 5 -0.031122 0.263562 -0.563199 6 -2.007436 -2.509086 -0.563199 7 0.380610 0.263562 1.775573 8 0.792342 0.263562 1.775573 9 -0.031122 0.263562 1.775573
This code reads a CSV into a pandas DataFrame, normalizes numeric columns using z-score to mean 0 and standard deviation 1, and saves the result to a new CSV. This standardization, which makes data Gaussian with zero mean and unit variance, improves machine learning model performance and outlier detection. However, we later opted for a less intrusive standardization method, as our data had no outliers.
Both normalisation and removing correlated inputs have been removed for the final dataset, we have considered these approaches however decided not to implement these features into the original dataset used for modelling
This SVC model is the comaprison model for logistic regression. The purpose of this model was introduced to test if another model could improve the problem statement was succesfull.
Support Vector Classification models excel at classifying high dimensional data. It performs well with data that contains a large number of features, however is computationally very expensive. Using Gridsearch allowed us to test a wide range of hyperparameters and ensured we would find the most accurate combination, this took a lot of time to train as shown above. The excessively high accuracy rate could mean that the model was a very good choice for this dataset or that overfitting has occurred. To attempt to avoid overfitting the data was scaled as discussed in a previous cell, however this had no noticeable impact on the model's accuracy.
X = df_updated.drop('salary', axis=1) # Features
y = df_updated['salary'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(df_updated, y, test_size=0.2, random_state=42)
svm_model = SVC(C=10, kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)
# Evaluate the model without best Params
y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
# gridsearch stuff
param_grid = {
'C': [0.001, 0.1, 1],
'kernel': ['linear']
}
grid_search = GridSearchCV(SVC(), param_grid, verbose=2)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
# Use the best estimator to make predictions
y_pred = grid_search.predict(X_test)
# Evaluate the best model found by grid search
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
# Model is still overfitted.
from sklearn.model_selection import cross_val_score
# Use cross_val_score for cross-validationb
svm_model = SVC(kernel='linear', C=1, random_state=42)
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5) # Adjust the number of folds as needed
# Print average cross-validation accuracy
print("Average Cross-Validation Accuracy:", np.mean(cv_scores))
Accuracy: 0.9996926859250154
precision recall f1-score support
0 1.00 1.00 1.00 4905
1 1.00 1.00 1.00 1603
accuracy 1.00 6508
macro avg 1.00 1.00 1.00 6508
weighted avg 1.00 1.00 1.00 6508
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END .............................C=0.001, kernel=linear; total time= 9.0min
[CV] END .............................C=0.001, kernel=linear; total time=14.1min
[CV] END .............................C=0.001, kernel=linear; total time=63.9min
[CV] END .............................C=0.001, kernel=linear; total time=10.9min
[CV] END .............................C=0.001, kernel=linear; total time= 9.4min
[CV] END ..............................C=0.01, kernel=linear; total time= 2.3min
[CV] END ..............................C=0.01, kernel=linear; total time= 4.2min
[CV] END ..............................C=0.01, kernel=linear; total time= 3.5min
[CV] END ..............................C=0.01, kernel=linear; total time= 2.6min
[CV] END ..............................C=0.01, kernel=linear; total time= 3.8min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.1min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.6min
[CV] END ...............................C=0.1, kernel=linear; total time= 4.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.4min
[CV] END .................................C=1, kernel=linear; total time= 2.5min
[CV] END .................................C=1, kernel=linear; total time= 3.8min
[CV] END .................................C=1, kernel=linear; total time= 5.2min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
Best parameters found: {'C': 0.001, 'kernel': 'linear'}
Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 4905
1 1.00 1.00 1.00 1603
accuracy 1.00 6508
macro avg 1.00 1.00 1.00 6508
weighted avg 1.00 1.00 1.00 6508
Average Cross-Validation Accuracy: 0.9986169582647377
The high results show that the model may be overfitted. Examining data:
df_updated.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 1 | 2671 | 9 | 12 | 2 | 0 | 1 | 4 | 1 | 25 | 0 | 39 | 38 | 0 |
| 1 | 33 | 3 | 2926 | 9 | 12 | 1 | 2 | 0 | 4 | 1 | 0 | 0 | 12 | 38 | 0 |
| 2 | 21 | 2 | 14086 | 11 | 8 | 0 | 4 | 1 | 4 | 1 | 0 | 0 | 39 | 38 | 0 |
| 3 | 36 | 2 | 15336 | 1 | 6 | 1 | 4 | 0 | 1 | 1 | 0 | 0 | 39 | 38 | 0 |
| 4 | 11 | 2 | 19355 | 9 | 12 | 1 | 8 | 5 | 1 | 0 | 0 | 0 | 39 | 4 | 0 |
This shows that the data could be scaled better, particularly in the fnlwgt feature. The high range of values in the fnlwgt feature could be responsible for the odd performance of the model. The below code attempts to scale the data in an attempt to prevent overfitting:
# The scaling step was added after the model achieved 99.8% accuracy, pointing to possible overfitting.
# High values in the dataset may be the reason for this. The code below scales the values in the dataset.
from sklearn.preprocessing import StandardScaler
X_train, X_test = scale_the_features(X_train, X_test)
# gridsearch stuff
param_grid = {
'C': [0.001, 0.1, 1],
'kernel': ['linear']
}
grid_search = GridSearchCV(SVC(), param_grid, verbose=2)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
# Use the best estimator to make predictions
y_pred = grid_search.predict(X_test)
# Evaluate the best model found by grid search
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
# Model is still overfitted.
from sklearn.model_selection import cross_val_score
# Use cross_val_score for cross-validationb
svm_model = SVC(kernel='linear', C=1, random_state=42)
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5) # Adjust the number of folds as needed
# Print average cross-validation accuracy
print("Average Cross-Validation Accuracy:", np.mean(cv_scores))
Accuracy: 0.9996926859250154
precision recall f1-score support
0 1.00 1.00 1.00 4905
1 1.00 1.00 1.00 1603
accuracy 1.00 6508
macro avg 1.00 1.00 1.00 6508
weighted avg 1.00 1.00 1.00 6508
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END .............................C=0.001, kernel=linear; total time= 9.1min
[CV] END .............................C=0.001, kernel=linear; total time=14.2min
[CV] END .............................C=0.001, kernel=linear; total time=12.0min
[CV] END .............................C=0.001, kernel=linear; total time=11.1min
[CV] END .............................C=0.001, kernel=linear; total time= 9.4min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.1min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.4min
[CV] END ...............................C=0.1, kernel=linear; total time= 4.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.4min
[CV] END .................................C=1, kernel=linear; total time= 2.4min
[CV] END .................................C=1, kernel=linear; total time= 3.9min
[CV] END .................................C=1, kernel=linear; total time= 5.3min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
Best parameters found: {'C': 0.001, 'kernel': 'linear'}
Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 4905
1 1.00 1.00 1.00 1603
accuracy 1.00 6508
macro avg 1.00 1.00 1.00 6508
weighted avg 1.00 1.00 1.00 6508
Average Cross-Validation Accuracy: 0.9986169582647377
Logistic regression was the main source for solving the problem statment, which has been evaluted below.
Logistic regression is a faster machine learning method to train than SVC. Considering the size of our dataset, it made more sense to initially train a logistic regression model due to time constraints. The logistic regression machine learning model was selected as it is an efficient model to train on multi dimensional data containing a large number of features. The dataset appeared to be linearly separable, making a logistic regression algorithm a good choice. Although the model is simpler than a support vector algorithm, it has successfully learned from the dataset as evidenced in the shown confusion matrix.
First round of running the model, the accuracy is 81.1, we will now explore on how to improve the model.
X = df_updated.drop('salary', axis=1) # Features
y = df_updated['salary'] # Target variable
# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model
# Classification Report and Accuracy
y_pred = clf.predict(X_test) # Testing the model used the test data
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
print(classification_report(y_test, y_pred)) # Printing precision recall and other evaluation metrics
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
precision recall f1-score support
0 0.83 0.95 0.89 4942
1 0.71 0.39 0.50 1571
accuracy 0.81 6513
macro avg 0.77 0.67 0.69 6513
weighted avg 0.80 0.81 0.79 6513
Training Accuracy: 0.809
Testing Accuracy: 0.813
The ROC curve visualizes the trade-off between true positive rate (TPR, correctly classified positives) and false positive rate (FPR, incorrectly classified negatives) at different classification thresholds. The AUC score summarizes this performance across all thresholds, with a higher AUC (closer to 1) indicating better discrimination between classes (a random classifier would have an AUC of 0.5). This helps evaluate the logistic regression model's ability to distinguish between the two classes based on TPR (correctly identifying positives).
# Visualization 2: ROC Curve
y_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc_score(y_test, y_prob)))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
As we saw in data pre-processing when testing the correlation between input features, the two features are relationship and sex which had 0.58 correlation. even those this is not very high, we can see how the model performs by removing one of these features. Below we will drop relationship from our data set and run the model.
By removing 'relationship' as we can see the model performance lowered therefore this is not an ideal technique.
X = df_updated.drop(['relationship', 'salary'], axis=1) # Features
y = df_updated['salary'] # Target variable
# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model
# Classification Report and Accuracy
y_pred = clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
precision recall f1-score support
0 0.83 0.94 0.88 4942
1 0.68 0.40 0.50 1571
accuracy 0.81 6513
macro avg 0.76 0.67 0.69 6513
weighted avg 0.79 0.81 0.79 6513
Training Accuracy: 0.807
Testing Accuracy: 0.810
By removing 'sex' as we can see the model performance lowered therefore this is not an ideal technique.
X = df_updated.drop(['sex', 'salary'], axis=1) # Features
y = df_updated['salary'] # Target variable
# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model
# Classification Report and Accuracy
y_pred = clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
precision recall f1-score support
0 0.83 0.95 0.89 4942
1 0.70 0.39 0.50 1571
accuracy 0.81 6513
macro avg 0.77 0.67 0.69 6513
weighted avg 0.80 0.81 0.79 6513
Training Accuracy: 0.809
Testing Accuracy: 0.813
As part of optimising our model’s performance once the first model has been run, parameter tuning is used as shown below. This approach adjusted the models’ parameters such as the C value and max iterations to find the most optimal values to improve the models performance and make the model more effective at meeting our problem statement.
To speed up the tuning process grid search is used. This is a technique which speeds up the tuning process by finding the optimal parameters for us without having to manually trial and error different numbers to find the best result. We plug in certain possible values for the parameters, and it finds the best results. As we can see from the model accuracy below hyperparameter tuning has improved the model’s performance.
# Features and Target variable
X = df_updated.drop('salary', axis=1)
y = df_updated['salary']
# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression with GridSearchCV for parameter tuning
param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100, 500, 1000],
'max_iter': [1000, 1200, 1300, 1400, 1500]
}
clf = LogisticRegression(max_iter=10000, random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
# Train the model with the best parameters
best_clf = LogisticRegression(
C=best_params['C'],
max_iter=best_params['max_iter'],
random_state=42
)
best_clf.fit(X_train, y_train)
# Classification Report and Accuracy
y_pred = best_clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {best_clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {best_clf.score(X_test, y_test):.3f}")
Best Parameters: {'C': 100, 'max_iter': 1000}
Classification Report:
precision recall f1-score support
0 0.83 0.95 0.89 4942
1 0.70 0.40 0.51 1571
accuracy 0.81 6513
macro avg 0.77 0.67 0.70 6513
weighted avg 0.80 0.81 0.79 6513
Training Accuracy: 0.810
Testing Accuracy: 0.814
By implementing standard scalar to logistic regression the accuracy increased from 0.813 to 0.824. This happens because logistic regression relies heavily on the linear relationships between features and the target variable. When features have different scales, it can make it harder for the model to learn these relationships effectively. Standardisation addresses this by transforming each feature to have a mean of 0 and a standard deviation of 1. This puts all features on an equal footing, allowing the model to focus on the underlying patterns in the data and leading to a more accurate prediction of the target variable.
X = df_updated.drop('salary', axis=1) # Features
y = df_updated['salary'] # Target variable
# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale the features
X_train_scaled, X_test_scaled = scale_the_data_RG_29014027.scale_the_data(X_train, X_test)
# Logistic Regression
clf = LogisticRegression(C=1000, max_iter=1300)
clf.fit(X_train_scaled, y_train) # Training the model
# Classification Report and Accuracy
y_pred = clf.predict(X_test_scaled)
cm_new = confusion_matrix(y_test, y_pred)
# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train_scaled, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test_scaled, y_test):.3f}")
precision recall f1-score support
0 0.84 0.94 0.89 4942
1 0.71 0.46 0.56 1571
accuracy 0.82 6513
macro avg 0.78 0.70 0.72 6513
weighted avg 0.81 0.82 0.81 6513
Training Accuracy: 0.823
Testing Accuracy: 0.824
As we see in parameter tuning our model went up to 0.813 showing a improvement. Now we will conduct cross validation to improve out model using the optimal parameters found in hyperparamter turning. This is C=1000 and max_iterations = 1300
Once the most optimal parameters were found, cross validation was used to see if the model can be improved as a alternative to train_test_split. For this approach the k folds technique was used which splits and trains the data using multiple folds. The data set is divided into K equal size folds and each fold is trained on individually. 1 of the folds is used for testing and k-1 folds for training the model. This process is running k times with each of the k folds used once as the testing data to ensure equality. As we can see below this approach was used many times and tested using different number of folds. This experiment resulted in the k folds technique not improving the model’s performance.
# Features and Target variable
X = df_updated.drop('salary', axis=1)
y = df_updated['salary']
# Logistic Regression with specified parameters
clf = LogisticRegression(C= 1000, max_iter= 1300, random_state=42)
def crossValFunction(numOfFolds) :
# Perform cross-validation with 10 folds
folds_define = KFold(random_state=42, shuffle=True, n_splits=numOfFolds)
# Lists to store results
y_test_results = []
y_pred_results = []
score_list = [] # List to store accuracy scores for each fold
confusion_matrices = []
# Iterate through the folds
for train_index, test_index in folds_define.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train) # Train each fold
y_fold_pred = clf.predict(X_test) # Make predictions on each fold data
fold_accuracy = accuracy_score(y_test, y_fold_pred)
score_list.append(fold_accuracy)
fold_confusion_matrix = confusion_matrix(y_test, y_fold_pred)
# Append results to lists
y_test_results.extend(y_test)
y_pred_results.extend(y_fold_pred)
print(f"\nFold Accuracy: {fold_accuracy:.4f}")
print(f"Fold Confusion Matrix:\n{fold_confusion_matrix}")
# Evaluate the overall model performance using all the data
overall_accuracy = accuracy_score(y_test_results, y_pred_results)
overall_confusion_matrix = confusion_matrix(y_test_results, y_pred_results)
print(f"\nOverall Accuracy for all folds: {overall_accuracy:.4f}")
print("\nOverall Confusion Matrix for all folds is:\n", overall_confusion_matrix)
# Plot the boxplot of accuracy scores
plt.boxplot(score_list)
plt.title('Accuracy Distribution Across Folds')
plt.ylabel('Accuracy')
plt.show()
# Plot the overall confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(overall_confusion_matrix, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
crossValFunction(7)
Fold Accuracy: 0.8100 Fold Confusion Matrix: [[3334 181] [ 703 434]] Fold Accuracy: 0.8177 Fold Confusion Matrix: [[3394 175] [ 673 410]] Fold Accuracy: 0.8070 Fold Confusion Matrix: [[3320 208] [ 690 434]] Fold Accuracy: 0.8119 Fold Confusion Matrix: [[3343 195] [ 680 434]] Fold Accuracy: 0.8063 Fold Confusion Matrix: [[3330 192] [ 709 420]] Fold Accuracy: 0.8078 Fold Confusion Matrix: [[3333 194] [ 700 424]] Fold Accuracy: 0.8138 Fold Confusion Matrix: [[3312 209] [ 657 473]] Overall Accuracy for all folds: 0.8106 Overall Confusion Matrix for all folds is: [[23366 1354] [ 4812 3029]]
crossValFunction(10)
Fold Accuracy: 0.8096 Fold Confusion Matrix: [[2316 140] [ 480 321]] Fold Accuracy: 0.8160 Fold Confusion Matrix: [[2351 135] [ 464 306]] Fold Accuracy: 0.8179 Fold Confusion Matrix: [[2378 135] [ 458 285]] Fold Accuracy: 0.8056 Fold Confusion Matrix: [[2305 135] [ 498 318]] Fold Accuracy: 0.8050 Fold Confusion Matrix: [[2324 137] [ 498 297]] Fold Accuracy: 0.8034 Fold Confusion Matrix: [[2348 135] [ 505 268]] Fold Accuracy: 0.8090 Fold Confusion Matrix: [[2353 137] [ 485 281]] Fold Accuracy: 0.8102 Fold Confusion Matrix: [[2315 132] [ 486 323]] Fold Accuracy: 0.8084 Fold Confusion Matrix: [[2317 161] [ 463 315]] Fold Accuracy: 0.8188 Fold Confusion Matrix: [[2335 131] [ 459 331]] Overall Accuracy for all folds: 0.8104 Overall Confusion Matrix for all folds is: [[23342 1378] [ 4796 3045]]
crossValFunction(20)
Fold Accuracy: 0.8109 Fold Confusion Matrix: [[1159 71] [ 237 162]] Fold Accuracy: 0.8133 Fold Confusion Matrix: [[1162 64] [ 240 162]] Fold Accuracy: 0.8077 Fold Confusion Matrix: [[1165 62] [ 251 150]] Fold Accuracy: 0.8206 Fold Confusion Matrix: [[1183 76] [ 216 153]] Fold Accuracy: 0.8268 Fold Confusion Matrix: [[1204 68] [ 214 142]] Fold Accuracy: 0.8084 Fold Confusion Matrix: [[1174 67] [ 245 142]] Fold Accuracy: 0.8139 Fold Confusion Matrix: [[1180 51] [ 252 145]] Fold Accuracy: 0.8034 Fold Confusion Matrix: [[1133 76] [ 244 175]] Fold Accuracy: 0.8041 Fold Confusion Matrix: [[1167 73] [ 246 142]] Fold Accuracy: 0.8059 Fold Confusion Matrix: [[1155 66] [ 250 157]] Fold Accuracy: 0.8243 Fold Confusion Matrix: [[1184 72] [ 214 158]] Fold Accuracy: 0.7955 Fold Confusion Matrix: [[1156 71] [ 262 139]] Fold Accuracy: 0.7961 Fold Confusion Matrix: [[1161 91] [ 241 135]] Fold Accuracy: 0.8016 Fold Confusion Matrix: [[1161 77] [ 246 144]] Fold Accuracy: 0.8145 Fold Confusion Matrix: [[1167 60] [ 242 159]] Fold Accuracy: 0.8065 Fold Confusion Matrix: [[1146 74] [ 241 167]] Fold Accuracy: 0.8163 Fold Confusion Matrix: [[1178 74] [ 225 151]] Fold Accuracy: 0.7973 Fold Confusion Matrix: [[1143 83] [ 247 155]] Fold Accuracy: 0.8077 Fold Confusion Matrix: [[1155 72] [ 241 160]] Fold Accuracy: 0.8329 Fold Confusion Matrix: [[1180 59] [ 213 176]] Overall Accuracy for all folds: 0.8104 Overall Confusion Matrix for all folds is: [[23313 1407] [ 4767 3074]]
Although the SVC model far outperformed the Logistic Regression model, the Logistic Regression model was less likely to be overfitted when training on the data. The Linear Regression model performed well in this use case. Comparing the two models, the Support Vector Machine had a far higher accuracy rate.
We have proven that it is possible to accurately predict if a given person makes over 50,000 per year based on several factors. A Logistic Regression machine learning model was an acceptable choice for this, and performed with 82.4% accuracy on a cleaned and prepared dataset. Future work could include experimenting with different machine learning models specialising in linearly seperable datasets such as an LSVM, which is a support vector machine specifically optimised for linear kernels. This would have reduced the training time, allowing us to experiment with more C values.
As part of our group we held regular weekly meetings at a suitable time to establish a clear communication plan and we brainstormed solutions, and delegated tasks. Our team worked extremely well together. Following each call, each member receives a specific task to complete before the next meeting. This focused approach keeps everyone accountable and ensures steady progress. Finally, we have merged everyone's contributions into a final document, reflecting the equal effort and expertise each member brings to this project. This collaborative approach resulted in completing the coursework before the deadline and worked well within the group where everyone was satisfied with the overall result.
Where individuals were struggling or slow progress, the team worked effectively to patch up and overcome the issue collectively. This showed our team effectively worked as a team and took priority and ownership of the project. We subsequently split the group sometimes into smaller groups to work on more focused areas to ensure fast progress which worked very effectively at achieving the goal and meant more enhancements could be made to our model. Overall the team was very happy with the team work and enthusiasm shown by all team members.
We distributed roles within the group as we were discussing the nature of the tasks we wanted to undertake and the dataset we were using. This involved discussing each of the segments in the specification, such as preprocessing and making the model, and assigned them in an order of who has something left to contribute as we went along, as some tasks needed to be completed before others could start. Overall we managed to successfully distribute the work so we would all have something to contribute while completing this project.
RR 100%, RG 100%, HS 100%, KS 100%, AS 100%
I took a lead role in driving the project forward, handling a substantial portion of the work. This included data preprocessing, exploratory data analysis (EDA), and building the logistic regression model. Additionally, I ensured clear communication and collaboration by structuring a working notebook and merging everyone's contributions into a cohesive whole.
I worked on Gaussian distribution and removing correlated inputs, the challenges I faced included researching different standardisation methods to find which ones would best fit. By the end of this I was able to produce a working module though I had made the not to use my own standardisation methods as it was too intrusive to our datatypes.
I used a new external package (Plotly) for part of the EDA which explored the distribution of data between males and females and their salary distributions for analysis. The challenge was using a new library not taught in lecture, therefore to learn Plotly I adhered to the documentation for help and other online resources.
The SVM model with linear kernel was used on the dataset as it excels at making predictions on linearly separable data. Issues with this include linear kernels being more susceptive to outliers in the dataset and likely overfitting of the model. Learning about the gridsearch module saved a lot of time, automating the sequential trial of differnet C values.
Overall, this was a great learning experience for me. As part of my work, I created a module file to simplify the checking of duplicate values for EDA. My main responsibilities include helping come up with innovative pre-processing techniques and focused on the development of cross validation and parameter tuning for the logistic regression model. One challenge faced was finding the most optimal parameters for our model however through research a technique known as gridsearch was found to simplify the process. I believe this coursework has allowed me to understand the data science process and enhance my programming knowledge skill wish I intend to build on further in later studies.